Category: Machine Learning Systems

Batching in LLM Serving Systems
Faster Causal Self Attention
GPU Architecture and Programming
GPU Kernel Programming with Triton and CUDA
How to write a fast kernel
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory
Intro to Mixture of Experts (MoE) in LLM Serving Systems
Memory Management in LLM Serving Systems
Modeling and Scaling Performance with Roofline
Optimizing GPU Kernels
Parallelism in LLM Serving Systems
Performance Modeling for LLM Serving Systems
Practical Lessons from Predicting Clicks on Ads at Facebook
Quantization in LLM Serving Systems
Recommender Systems
Sparsity and Pruning in LLM Serving Systems
Speculative Decoding in LLM Serving Systems
Transformer Architecture and Implementation